Author
|
Topic: ZCT SYMPTOMATICS
|
Poly761 Member
|
posted 07-30-2006 10:19 PM
Reviewed ZCT information (8-'05) where it was reported Krapohl indicated symptomatics were not needed in a ZCT. A response to this post reported some argue the test is (not) valid if they are removed or replaced with neutrals.Anyone aware of any changes in the ZCT involving symptomatics? Taylor Member posted 08-26-2005 08:56 AM -------------------------------------------------------------------------------- I attended a session by Krapohl at the APA training seminar in Albuqureque a few years ago. I recall, and if I am wrong please someone correct me, that he indicated symptomatics were not needed in a ZCT. In fact, I believe he felt it brought issues to the exam that were best left outside the polygraph. Taylor IP: Logged Barry C Member posted 08-26-2005 09:00 AM -------------------------------------------------------------------------------- Yes, he is one of many, as I pointed out in the new thread, who takes that position. However, some try to argue that a test is no longer "valid" if you remove them or replace them with neutrals, something I disagree with.
IP: Logged |
Barry C Member
|
posted 07-31-2006 05:34 AM
What do you mean by "changes"? All the tests are still the same, which means those that use symptomatics still call for them. The Utah tests don't use them, but instead use an "introductory buffer" question for question one (if you use one at all), which reads something along the lines of, "Do you understand I will only ask the questions we have reviewed?" After that, it's ignored. Reactions or a lack of reaction(s) is meaningless. It's just to focus the examinee on the fact that the test is underway. IP: Logged |
rnelson Member
|
posted 07-31-2006 09:56 AM
This came up at the recent APA conference, in a somewhat humorous exchange when Donnie Dutton (presenter) put Mr. Krapohl (in the back of the room) on the spot.Characteristically, Mr. Krapohl stuck to the facts and offered practical suggestions. To paraphrase: Mr. Krapohl said that there is no evidence that sypmtomatic questions work. (I believe he or someone else also indicated they may negatively affect the scores of truthful subjects - perhaps someone else could help clarify this.) Mr. Krapohl, as a program manager, also reminded us that adherance procedures are not unimportant, and that if we conduct examinations that are subject to quality-control (assuming that quality control officers will look for procedural adherance) then we had better use our techniques properly, as established, including a sympotomatic question when required by standardized procedure - or the test may not be supported. (Like in graduate school - write for your audience). This is another good example of the workings of science, and the developmental course of a profession. Assumptions, in science are often referred to as hypothesis, and it is the researchers job to prove a hypothesis correct. The accepted method for doing this is the scientific method, involving experiments and measurements that reject (at a statistically significant level of confidence) the null-hypothesis. Only through rejecting the null-hypothesis (through repeated or replicated research attempts) can we accept the hypothesis. So, Mr. Krapohl' statements "no evidence to support" simply mean that research has been unable to reject the null-hypothesis (probably some version of "symptomatic questions make no difference.") Inability to reject the null-hypothesis does not prove the null-hypothesis - it does not prove they make no difference - it simply fails to prove they do. There may also be evidence of some minor iatrogenic effects with truthful subjects. Only engineers and apologists set out to prove their assumptions correct. Engineers are basically applied-scientists. Apologists have reached their conclusions before engaging in research, and are unwilling to be wrong. 40-some years ago, when dinosaurs roamed the primordial ooze that we call earth, research did not exist to support, define, or quantify all of the aspects and challenges that encompass the modern polygraph test. However, the absence of research does not mean that test designers (and policy makers) do not act and make decisions. Does anyone think there was research to support the effectiveness of the first traffic lights or crosswalks? Or, would anyone claim that all of our "Megan's Laws" or "DARE" programs were supported in advance by research data? Megan's laws are still controversial, but early research seemed to establish a correlation not with reduced recidivism, but with shortened time to apprehension. There was also an observed increase in violence upon reoffense, and that was speculated to be associated with efforts to avoid apprehension. Few would dipute the need for traffic lights. However, crosswalks alone do not generally appeear to be associated with decreased pedestrian/vehicle accidents. DARE programs, which have enjoyed well developed delivery mechanisms in school and community policing initiatives, does not appear to reduce illegal drug use, and, like many other resistance and refusal education programs, there is evidence of iatrogenic effects - meaning there is some correlation with increased drug use. (Actually iatrogenic refers to unanticipated negative effects - like the pain meds Viox and Celebrex that have been correlated with increased cardiac risk.) Nevertheless, policy makers must continue to make policy. In the absence of research data, they will continue to make policy based upon theoretical assumptions. The smartest thing to do is to subsequently study the efficacy of those assumptions and decisions, and adjust policies accordingly. It would be arrogant and unscientific of us to neglect to study our assumptions, or to neglect to update our field practice guidelines. Some changes are occurring in polygraph. DoDPI has a new scoring system (very much like the Utah system), that is empirically derived and will improve interrater reliability. r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(Dr. Strangelove, 1964) IP: Logged |
Barry C Member
|
posted 07-31-2006 10:40 AM
Has DoDPI finally accepted the new scoring system? It's been in the works for a while.As far as the evidence goes for symptomatics, the Honts & Amato study showed examiners couldn't use them to conclude outside issues as greater than chance rates, and they did lower the scores of the truthful - a problem when you realize most scorers are biased against the truthful to begin with. No other studies show they do what they are supposed to do. The question that arises now, however, is, "Do examiners invite an outside issue by bringing it up when introducing the symptomatic questions?" This goes back to how one argues "validity." Is a particular technique "valid," or are principles validated through research? If you hold only to the former, then you're stuck - more so than some will admit. If you deviate from the CQs used in the research studies (and you do with almost every test you run), then you have to ask if your test is "valid." If, however, you hold to running a test in which you put together all the "validated" principles (that actually work) and leave out what doesn't, then you should have the optimal test, which is how we got the Utah tests and why the DoDPI tests are changing (at least the scoring portion). In the end, Honts and Raskin wrote the Kleiner chapter to show all CQTs that employ scientifically valid (by that I only mean they work very well, but I wish we could all speak the same language on that one some day) principles is a "valid" test. Even in a Utah test you have the option of using a symptomatic question, but why would you given there's not yet evidence to support they work as planned, and they might even work against you. Both tests would be "valid," but one will likely give you consistently more accurate decisions. IP: Logged |
rnelson Member
|
posted 07-31-2006 11:39 AM
Barry,APA was apparently the first time it was announced outside of the government, that there is a new scoring system. The criteria look very much like the Utah criteria - what Don Krapohl call "the defensible dozen" because there are only a small number of features compared with the numerous criteria defines by the previous federal and Backster systems. Backter and federal systems were originally developed from conceptual and theoretical models. I believe the federal system was modelled after the Backster system. The Utah system was developed empirically. That there is some prominent conceptual overlap in between the federal/backster systems and the Utah / new-federal systems speaks to the accuracy of some of the initial assumptions. That this will ultimately provide better validity is easy to imagine. It is unfortunate that it took so long for us to come up to speed with the basic principle that simpler systems offer better potential validity because they will produce greater interrater reliability. Complex systems, with human raters, and many features can be expected to produce more variability (disagreement) among scorers, meaning decreased reliability (inter-rater reliability) which will ultimately limit the degree of validity (concurrent validity - though I think this is really a kind of a-posteriori) attainable. Of course computer/machine scoring systems can reliably handle very complex coding schemes. Don Krapohl recommended the 1999 article by Bell Raskin, Honts, and Kircher. For anyone that hasn't seen it, several people on this list have it. I posted a link some months ago. Mr. Krapohl also recommended the use of a photo-plethysmograph component, in addition to the activity/countermeasure sensor. Interpretation of the photoplethysmograph scoring system is, I believe, addressed in the 1999 Utah article. r
------------------ "Gentlemen, you can't fight in here, this is the war room." --(Dr. Strangelove, 1964) IP: Logged |
Barry C Member
|
posted 07-31-2006 01:44 PM
The Bell et al article does discuss the photoplethysmograph. The feds just started toying with them.It is a must-read for all examiners who want to improve their scoring. The only problem I have is the 2:1, 3:1, 4:1 EDA criteria, which the OSS research showed was not optimal, and Don recently published (this month) an article on that topic. The federal change has been in the works for some time, but they have so many hoops to jump through before they can make something "official" it's crazy. I think the only differences between the feds and Utah now are the ratios. By the way, as far as I know the change wasn't based on any new research, but rather, it just took that long for people to agree on what we've already know for a while. It's yet another debate in the validity argument as there is now (apparently) a new way of scoring Federal ZCTs - a method that differs from the original research studies. Does that mean if you score a Federal ZCT with the new criteria the test is not "valid"? Some would have us believe that is the case, but that's a topic for the other thread. IP: Logged |
rnelson Member
|
posted 07-31-2006 05:01 PM
Barry,You've made a very important point here. Its come up before, but the gist of it is that we have a tendency to use the term "valid" with some variability in meaning. Validated = [empirical properties described by research - through multiple peer-reviewed studies Valid = construct validity - does the polygraph measure what we say it measures. Because the lies and truth are physically amorphous, we are faced with the challenge of adequately defining the concepts. The best way to do do this seems to be to operationalize the concepts into human activities or behaviors – 'lying' and 'truthtelling'. Because behaviors are also physically amorphous (i.e. no material substance itself) we get to infer presence or absence of these activities through statistical analysis of the correlation between those behaviors and certain physiological reactions. Because those physiological reactions are normally occurring and associated with multiple human phenomena, we depend upon the aggregated correlation of multiple physiological features to to determine the presence or absence of deception. Valid = concurrent validity - do polygraph results concur with other identifiable case facts (ground truth) - its tempting to want to think about this as predictive validity, but its not, its a-posteriori, because the events have already occurred Valid = adherence to procedures - which is really an attempt to create validity in the absence of an adequate understanding of the empirical foundations of a test or method (i.e., psychology, physiology, testing principles, probability statistics). When we really don't understand the principles of the test, then our only measurement of validity is whether we followed the procedures in the correct order, the way we were taught. In this vein of thought, any deviation from procedures would invalidate a test. Certainly procedural adherence is a QC concern, but if it were the only concern then adherence to any procedure would produce a "valid" outcome according to that procedure - even voice stress. I'm not suggesting that proper procedures don't matter, only that "validity" is really a matter of empirical principles. If we could not talk about valid principles, and concerned ourself only with established procedures as the measurement of validity, then we would be at risk for being viewed by our detractors as some form of mystified or ritualized psuedo-science. Valid = supported by QC review - I'm not sure this is the same concern as empirical validity, but seems to be the same concern as adherence to procedure. QA reviewers are tasked with making a recommendation about whether a test or result should be "supported" as adequately compliant with established procedures and empirical principles. Just as adherence to procedures does not automatically make a test "valid," to invoke the term "invalid" may be a subtle misuse of the concept. QC programs may not so much determine whether a particular test was valid, but whether it should be "supported" as compliant and adherent to established procedures. Valid = incremental validity I think most of us have seem some form of test which, though somehow compromised, proved highly informative to people faced with the need to make decisions. So we are back to the concept of utility, which is probably more accurately described as "incremental validity" as that is a more recognizable term in the sciences, and does not convey the mistake attribution of "in-valid" or "in-validity." I'm not suggesting this is the be-all and end-all of our validity lexicon, but I do think its time to begin thinking about using some very well-stipulated terms more carefully. I also think its important to think carefully about how we describe our testing concepts and protocols, using language and concepts that are consistent with other sciences. for example... As PCSOT programs move towards emphasizing an evidence-based approach, we must keep pace. Just as the concept of a "full sexual history disclosure polygraph" (in which we attempt to learn everything in one test) appears to be an arcane concept, the term "utility" may also have run its course. I'm recently faced with a number of questions and arguments about test validity, and to suggest that PCSOT tests are for "utility" does not satisfy our consumers, whose decisions are influenced by the test results, as the word "utility" conveys that the tests may not be "valid." r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(Dr. Strangelove, 1964) IP: Logged |
ebvan Member
|
posted 08-01-2006 10:49 AM
I may be stepping off into deep water here, but my feeble mind has a couple of questions about validity. A) Would it be accurate to say that based on the Hont's/Amato study which indicated that the presence of symptomatic questions appeared to negatively effect the test scores of truthful individuals; that they (SQ's) introduce two flies into the construct validity oinment by introducing an issue that is both: #1 irrelevant to the goal of an examination designed to provide the examiner with sufficient data to render an opinion based on the physiological data concerning the relevant issue? #2 damaging to the contstruct validity of a testing format because it impacts test results? It seems that standardized testing formats and adherance to procedures at the current state of the science are required for quality control and validity because changing questions or adding/subtracting questions from a standardized format introduce unquantifiable variables that may or may not effect construct validity. B)If one were conducting a validation study of Comparison Question Techniques, wouldn't it make more sense to use a stripped down questioning format using only CQ's and RQ's; study what they tell us and then use the data as a base to establish the chain of inference to build the incremental validity of formats which add Irrelevants, Sacrifice Relevants, or Symptomatics? Don't LAUGH too hard this was the best I could do at organizing my thoughts today. ebv IP: Logged |
rnelson Member
|
posted 08-01-2006 03:26 PM
Ebvan, It does make sense to validate the simplest core principles first, then see if other ideas are additive. What you are really describing is validating the basic constructs first, and then deterimining whether things like symptomatics offer incremental validity to the concurrent/predictive/a-posteriori valdity of the test results. Its interesting to note that the symptomatic (I've heard) was added because of complaints from test subjects that they didn't trust the examiners not to trick or surprise them. So, the addition of the sympotmatic was intended to improve the construct validity of the test - our confidence that it measures what we intend it to measure (that its not being impacted by an outside issue). This is a good example of the myriad of theoretical concerns that underly every assumption we make, and why it is important to validate our assumptions through research data. Otherwise, we simply do what we have always done. Taken too far, procedural adherence had resulted in some interesting turns of history. For example, WWI was the first major conflict that employed the automatic (Maxim) machine gun (which the German military made effective use of against the technique of long lines of infantry (Napoleonic tradition) charging out of trenches (a technique developed I believe in the Boer war and perhaps Cuba). The interesting twist of history is that the Maxim machine gun was designed in Utah (like many other famous firearms), by Hiram Maxim and offered to the US Army who was uninterested and opted for the tradition of long lines of soldiers with rifles. So, Maxim found an overseas customer for his invetion. Eventually both side of the conflict employed automatic machine guns, against which charging infantry were largely ineffective and made for a long war. Peace, r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(Dr. Strangelove, 1964) [This message has been edited by rnelson (edited 08-01-2006).] IP: Logged |
ebvan Member
|
posted 08-01-2006 04:06 PM
Actually the both the U.S. Navy and the 1st U.S. Volunteer Calvary officially adopted the Colt model 1895 Machine Gun (Potato Digger) during the Spanish American War and it saw limted use in and around Cuba. By the outbreak of WWI most major powers had some sort of automatic, as opposed to hand cranked "Machine Gun" The Delay of getting the machine gun to U.S. Soldiers in the trenches was more probably the result of financial manoevering for defense profits. but to return to the main subject: I think Procedural adherance is much more critical because it reduces variables in testing where we must infer presence or absence of truthfulness through statistical analysis of the correlation between those behaviors and certain physiological reactions than it is in testing something like a machine gun where construct validity can be established by analyzing the outcome of loading it and pulling the trigger thus producing easily quantifiable results ie. holes in the desired target. IP: Logged |
rnelson Member
|
posted 08-01-2006 04:36 PM
I love history. I've heard only a little about the potato digger, but I think it was a John Browning thing - like the 1911, FN Hi Power, BAR, Winchester 1892, and some interesting shotguns, - also from Utah.My point was primarily that we ought to think carefully about our methods, and not be afraid to rethink things when necessary. I couldn't agree more about the need for procedures and standards. I think your earlier point was correct, that when we start trying to do too many things at once, then becomes difficult to know what is actually making the desired difference. ------------------ "Gentlemen, you can't fight in here, this is the war room." --(Dr. Strangelove, 1964) IP: Logged |
ebvan Member
|
posted 08-02-2006 03:32 PM
Machine guns are certainly a more lively topic. Nothing reinforces a conversational point quite like letting Ma Deuce and a few hundred rounds punctuate your sentences. Actually we're both right about the 1895. The potato digger was co-developed by colt and browning ( Meaning Browning supplied the brains, Colt supplied the cash.)We're actually very close. My position of adherance to procedures to avoid adding variables to the construct and your position of not being afraid to rethink things when necessary, fit nicely together as: Adherance to procedures is important. If you choose to deviate from procedure, you should do so only after careful thought regarding the potential consequences. In other words stick with procedures unless you know why you're not sticking with procedures. If you deviate from procedures and devise a modified-Billy-Bob-Utah-Quadri-MGQT-Bi-Zone of Tension Format, be prepared to explain it.
[This message has been edited by ebvan (edited 08-03-2006).] IP: Logged |
jrwygant Member
|
posted 08-05-2006 02:36 PM
I just stumbled across this message thread and can't resist adding a few observations.First, weren't our test formats developed theoretically, rather than through research? We do what we do today because the guys who came before us did it that way. Back in the dinasaur era, an examiner thought that a good way to do a test would be to include certain questions and to put them in a particular order, and that became a format. Some of it derived from "politics," one examiner not liking what another examiner was doing. The federal system, by the way, was initially developed by Backster, according to him, and still represents some of the qualities of his oldest format, before he made changes in what he teaches in his own school. Second, Backster would probably admit today that he devised symptomatic questions from a purely theoretical basis, no research. Again, there was some politics involved. His concern about "outside issues" is also reflected in his use of comparison question bars. John Reid hated both bars and symptomatics, which were contrary to his own theoretical approach. Third, most examiners who do zone tests include symptomatic questions and then don't bother to evaluate them in any manner, so in that very limited sense the questions are worthless. The previous suggestion that we do research on stripped-down tests, relevants, comparisons, and a few buffer neutrals at the beginning, makes a lot of sense. In the meantime, to say that a test is "invalid" because it does not include symptomatics, makes no sense unless someone can show that a test WITH symptomatics produces better accuracy. There was also some interest a few years ago in whether symptomatics reduced the number of inconclusives. To my knowledge, that remains unresolved, but my own personal experience -- comparing 100+ exams with symptomatics to 100+ without -- revealed no difference in my inconclusive rate (unfortunately). [This message has been edited by jrwygant (edited 08-05-2006).] IP: Logged |
J.B. McCloughan Administrator
|
posted 08-05-2006 11:35 PM
Jim,From what I have read you are correct in saying that most of the formats are hypothesized. Regardless of whether or not the formats have been tested through research, a theory behind the deceptive based polygraph has never been fully tested and proven. Construct validity comes from this root source. Polygraph has pounded the pavement with Criterion validity but little to no attention has been given to the most important validity needed. I know that most know this but just for the sake of discussion here let us look at an example. Say that I hypothesis that one catches fish on a night were the moon is full. So I set out to prove it. I fish ever night the moon is full and over 95% of the time I catch a fish. Now I boast that my hypothesis is correct because I did catch fish at night when the moon was full 95% of the time. But really I don’t know whether the night or the moon had anything to do with my ability to catch fish 95% of the time because I have not yet looked to see if there were other factors/variable that caused me to catch the fish. In traditional hypothesis testing, the Null Hypothesis is the first test set out to establish and this is where the proper use of the phrase statistically significant comes in. Then one must test the Alternative Hypothesis. But we are not alone. Even statisticians have trouble agreeing with what defines proper testing: http://www.warnercnr.colostate.edu/~anderson/PDF_files/TESTING.pdf The linked testing model might be applicable to polygraph hypothesis testing.
[This message has been edited by J.B. McCloughan (edited 08-05-2006).] IP: Logged |
Barry C Member
|
posted 08-06-2006 07:10 AM
I wasn't going to mention that some scientists are opposed (on scientific grounds) to the who null hypothesis method of proving anything, and they give some compelling reasons why.In any event, Charles Honts wrote a book review on Furey's book on (anti) polygraph. The review was published in Psychophysiology in May of 1993. It discusses construct and criterion validity, and I'll paste that portion here: quote: I see this book as marred by four nearly fatal flaws that are all too common in much of the commentary on the psychophys-iological detection of deception. The first of these flaws concerns the use of the term validity and, in particular, a failure to differentiate between construct validity and criterion validity (Muchinsky, 1990). Construct validity refers to the theoretical constructs that are associated with a test or technique. For example, one might ask what psychophysiological processes are being measured by a CQT or CKT. Construct validity is assessed by theoretically oriented studies aimed at showing convergence with other measures of the same construct and divergence from measures of other constructs. Establishment of construct validity is one of the most interesting and difficult processes in psychological science. Criterion validity is a much simpler notion. Criterion validity simply asks how closely the outcome of some test is associated with some known real-world criterion. The area of the psychophysiological detection of deception is poor in theory development and therefore lags behind in construct validation, which thus hampers the development of programmatic research and may help explain the hodgepodge nature of the research literature. However, in application, criterion validity is of critical importance, and construct validity is of almost no concern. Construct validation is not necessary for the application of a technique, but criterion validity is. Ben-Shakhar and Furedy confuse these issues and condemn the control question test primarily on construct validation issues. From their perspective, it seems that because construct validation is lacking, criterion validation is a moot issue. However, if construct validation were necessary for successful application, medical science would have been denied the benefits of aspirin for decades.
Notice he says that construct validity is of almost no concern as we really only care if in the end we can detect liars and truthtellers at greater than chance rates. IP: Logged |
Barry C Member
|
posted 08-06-2006 07:14 AM
I meant to add the medication illustration is perfect. Do any of you ever read those technical novels that come with your medication? If so, how many times have you read, "Mechanism of action is unknown"? It's scary, but you'll find it on many of the pills you've popped over the years. They don't know how or why the pill fixes your problem, but it does fix (or eliminate symptoms) it, and that's what counts for most.IP: Logged |
J.B. McCloughan Administrator
|
posted 08-06-2006 10:14 AM
What I got from the Honts’ writing was in the real world people don’t care how something works, just that it works. He was, in an educated way, telling Furedy and Ben-Shakhar that only academia are concerned with the theoretical issue surrounding polygraph and the rest of the world is using it successfully.However, courts look at the theory for a test method’s admissibility, in Daubert hearings. So it depends on how or to whom you are presenting polygraph as to whether or not the theory matters.
IP: Logged |
rnelson Member
|
posted 08-06-2006 07:32 PM
J.B. - good article again – I hope people don't forget that null-hypothesis testing is just one form of statistical inquiry.Jim - what you are describing as “formats are hypothesized” would be more accurately stated as “face validity” which is a form of validity – like “content validity” it is sometimes more of a starting point and involves the theoretical review of the material by subject-matter experts. We can start there, but we can't stop there. And Barry - We've known for a while that null-hypothesis testing alone is not by itself very informative. But that does not mean it is not worthwhile – just that we need other metrics too, such as descriptives, logistic regressions, factor analysis, and a bunch of types of correlations including Pearsons, Spearmans, Chi-square, and point-biserial. It is also important to understand the role of averages (means), and deviations – and the difference between standard deviations and standard errors (which explains why the polygraph works). Consider the hypothesis (research question or alternative hypothesis) that males are generally taller than females. The null-hypothesis in this inquiry would be that there is no significant difference. It is not hard to imagine where people might develop that hypothesis – from their observational (anecdotal) experience. Without math examples - imagine we measured a sufficiently large and representative crossection of the population, and imagine we computed the average (mean) heights and standard deviations for males and females. From our samples, we could compute a probability statistic (standard error) to estimate with a specified degree of confidence the range most likely to represent the population means for male and female heights. We could also plot these means and deviations on normal probability distributions, and we might expect to see two overlapping bell-shaped curves with different mean scores. We would assume that our male and female samples may have differing variances; in other words we cannot assume they have the same variance, because we regard them as independent samples (no individual could be part of both groups – so forget about all that freak of nature hermaphrodite stuff for the sake of this simplistic example). Remember: it is the difference between those two means that pertains to our hypothesis, and we could use our standard errors to estimate the mathematical probability that these male and female groups are in fact different groups. We set an arbitrary alpha (not so arbitrary when you consider that it is generally set at .05 or .01 and that in normal probability distributions 95 percent of individual measurements will fall within about two standard deviations and 99 percent will occur within about three standard deviations of the mean). So, if our null hypothesis is that there is no significant difference between the average height of males and females (that the two groups are actually one group with one mean score), then our null expectation is that our probability estimate will be over .05. If our estimate computes at less than .05 (or .01) then we allow ourselves to reject the null hypothesis with the conclusion that there is a less than 5 percent probability that our male and female means (average heights) are different by chance (accident) or that they actually have the same mean score. In other words the probability that they are actually different can be assumed to be over 95 percent (or 99 percent – depending on the established alpha, or what level of certainty is desired). While we could do this mathematically, or we could also investigate this more mechanically, by creating numerous samples of male and female groups – say: 30 groups of 250 males and 30 groups of 250 females. We could compute the difference between every possible male and female group pair, and average that difference, then compute the standard deviation of those differences scores. We could further compute, using our sample size (N=30) a standard error for this mean difference estimate. If our standard error range exceeded the boundaries of the normal range of the distribution of differences scores (two standard deviations), then we would be unable to reject the null hypothesis that the male and female mean height measurements are not different at a statistically significant level – that there is no difference. For our data to offer any meaningful conclusions, it is necessary to have large and representative samples that represent the population average with reasonable accuracy (generally that means samples of N>=30), but to establish statistical significance it is often necessary to have very large sample sizes that produce very small estimate errors (sometimes called standard errors of the estimate or standard errors of the statistic). In some data sets standard errors can be quite small – image our male and female samples with differnent estimates of the average (mean) heights for male and females. Now imagine that we have a sufficiently large sample that our standard errors of these estimates is small with no overlap of numerical confidence in the scores of the two (male and female) error ranges). Now also imagine the normal distribution ranges (curves) surrounding those male and female estimates with normally (wide) standard deviation ranges that can overlap among male and female persons. In other words some individual scores can be regarded as within the “normal range” for both male and female scores. So, in this example, an individual's height score could not be used to formulate a-posteriori (after the fact) diagnosis of an individual's gender as either male or female. (Important: just as in polygraph, scores to not "predict" past events.) To further illustrate how meaningless this “statistical significance” can be – consider that some women will be taller than men, and that height differences may also be mediated by ethnicity, and by age (at some age stratifications females are taller than males). In polygraph, we (without realizing it) depend up these same type of statistical models. While Hont's is correct about the confusion of Construct and Criterion validity, the most robust sciences can address validity from the level of basic, not just "applied” science. In polygraph, many constructs are well established in terms of construct validity – blood pressure, electrodermal reactions. Other phenomena are not as well established in terms of construct validity – just google “psychological set” and you'll see we have employed the term in some ways that are inconsistent with other sciences (we generally refer to the focus of attention and reaction potential on a certain issue of concern, while other psychological researchers use the term loosely to refer to psychological attitudes). We commonly employ well-known constructs from orienting theory, and conditioned response theory. Furedy's and others criticisms of construct validity in polygraph have focused mainly on the theory of relevant and control pairs, and differences in reaction potentials among deceptive and truthful subjects. The statistical models used above to describe the standard errors of the differences of mean scores between independent groups is the same model which could establish part of the construct validity of response potential to comparison-relevant question pairs. To do so, we would expect (hypothesize) that the mean scores of truthful and deceptive persons would be different (at statistically significant levels), and that their distribution of scores would not substantially overlap (scores occurring within any overlapping regions could not be confidently attributed to either the deceptive or truthful groups and would therefore be inconclusive). In addition to significant differences in mean and deviation regions, we would expect to see standard error regions for both groups to confirm a statistically significant degree of separation. In this polygraph example we could compute the z-statistic for an individual's score and assign them with confidence to either the deceptive or truthful group, or discuss the proportion or percentage of other truthful or deceptive scores that we would expect to be higher or lower than an individual score. This is not a reliability estimate (as Identifi incorrectly attempts to call it), and is not probability of deception as Polyscore and Axciton incorrectly try to call it - and who ever heard of probability scores equal to or over 100 (except maybe in the Hitchiker's Guide to the Galaxy) - 100 percent is certainty, not probability. It may be thought of as the probability that a truthful (or deceptive) person would produce a similar (higher or lower) score. Of course, our null hypothesis would be that there is no statistical significance between the distributions mean scores between truthful and deceptive groups, or that the standard error of the differences of means of those independent groups is not statistically significant. I think it makes sense to continue developing our ability to discuss polygraph validity concerns from a number of common validity paradigms. I will most likely continue to argue that our incorporation of the concept of “utility” is fast becoming arcane (as J.B. pointed out, some persons do care about whether our explanations make sense), and we should begin using the more accurate, favorable, and accepted language of “incremental validity.” r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(Dr. Strangelove, 1964)
[This message has been edited by rnelson (edited 08-06-2006).] IP: Logged |
Barry C Member
|
posted 08-06-2006 08:16 PM
Ray,You're preaching to the choir. My point is that that's just one more dog for a lawyer to throw in the fight so that nobody ever hears polygraph is able to discern the liars from truth tellers - just let two opposing parties fight over the theory(ies) behind the statistical analysis. IP: Logged |
rnelson Member
|
posted 08-07-2006 08:07 AM
Barry,sorry if that was too pedantic - I get that way. I'm simply concerned that people may continue to discard statistics with some cavalier arrogance that it doesn't matter. Someday, somewhere, somebody will win an important argument because we have thought this through, or not. Its the smarter folks whose questions we will eventually have to answer. To the simpletons we say “deep-breathing, sweaty palms and rapid-heartbeat” and “remember what it felt like when you lied to your mom...” A couple of years ago I did a half-day workshop in which I offered a demonstration model showing that as few as 10 data-points can establish a significant difference between relevant and truthful question pairs. It is not difficult to achieve 10 data points using a two or three relevant question Zone technique with three to five charts and a three or seven position scoring system based on features from Kircher, Utah, Backster, Federal, or the "defensible dozen" features that Don Krapohl presented at the APA conference. One of the things we might begin to think about (taking a lesson from Bill Gates and other proponents of object-oriented computer programming) is "modularity" in which we view various pieces of the polygraph test as distinct, or modular constructs with their own validity concerns. Validated construct modules (which are really validated testing principles), become available as generalizable knowledge Modularity allows us to investigate any change in individual construct modules, while holding constant our testing procedures in other construct modules, thereby allowing us to accelerate the rate at which we establish the incremental improvements to the validity (construct, criterion, and a-posteriori) of our techniques – instead of having to re-validate in entirety every new polygraph technique or idea, and instead of invalidating or preventing the generalization of sound constructs to other techniques. In this way, for example, we might see Zone tests as single issue diagnostic tests in three or two question variants (perhaps with different RQ guidelines regarding primary, secondary, and evidence-connecting questions), alongside MGQT variants as multi-facet or mixed issue investigative or screening tests. Construct modularity would allow us to employ or investigate a variety of chart data analysis methods, including the Utah criteria, Kircher features, Backster, Federal, or “defensible dozen,” along with further modularity of three or seven position coding and modularity of various ratio schema including those offered by Utah, the new-federal (not sure what to call it), and the good-'ole “bigger is better rule.” A modular understanding of the validity of various testing constructs would instruct us to employ differing decision rules depending upon the testing circumstances, including: investigative decision rules (in which the primary objective is diagnostic sensitivity regarding a known incident and persons for whom there is reason to suspect involvement), screening rules (in which the primary objective is sensitivity regarding unknown concerns), and evidentiary rules (Senter rules) in which the primary testing objective may be diagnostic specificity leading to necessary action. Modularity further allows us to decide how to think about test results (DI/NDI vs SR/NSR) – with consideration for recognized testing theory, or policy (ASTM or agency policy) regarding differences in the empirical meaning of the results of screening and diagnostic tests. A modular or object-oriented understanding of the validated (construct validity) principles of polygraph testing also allows us standardize our data and develop computer scoring algorithms based upon widely recognized statistical models that any graduate student in the country would recognize. That alone might do more to improve the public image of polygraph testing than any amount of PR dollars – imagine if boat-loads of well-educated professionals walk around every city in the country with the knowledge that “yep, that's exactly how psychology, physiology, testing, and statistics works,” or “by-gum that polygraph stuff ain't voodoo-science after all.”. Peace,
r
------------------ "Gentlemen, you can't fight in here, this is the war room." --(Dr. Strangelove, 1964) IP: Logged |
jrwygant Member
|
posted 08-08-2006 11:52 AM
Well, it seems like we've kind of drifted away from the issue of symptomatic questions, so let me try to apply some of the comments about validity and reliability to field practice.Some examiners will insist that if a zone test does not contain a symptomatic it is not a valid zone test. Some insist on two symptomatics. Yet, when those examiners evaluate their charts they give no weight to the symptomatics. I don't mean no score, I mean no weight, because examiners who have used those questions for any extended time know that responses or lack of responses on them are unreliable. I once had an exchange with one of the original developers of PolyScore, asking if PolyScore looked at activity on questions other than relevant and comparison. The answer was that it did not, but that all of its validation studies for zone tests had been done with tests that included symptomatics. If those questions were removed from the mix, there was no telling what the consequences might be on the other questions. We face the same dilemma with regard to stim tests. Some examiners demand their inclusion in every test and insist that an examination is not complete without a stim test. Other examiners regard them as a waste of time, a potential source of confusion for the examinee, and the introduction of an element of "game playing" into an otherwise serious exchange between examiner and examinee. We not only do not have research that supports or refutes any of these positions (use or not use of symptomatics and stim tests), but we have examiners who insist that they must be included. Their insistence is based largely upon convention. If we know imperically that we can catch those fish on the night of the full moon without using symptomatic bait or stim test bait, maybe we should not be so insistent on strict adherence to convention. Maybe we should instead recognize that our chart evaluations are usually based entirely on relevant and comparison questions, that the rest of the questions are pretty much ignored and serve primarily as buffers leading up to the relevant / comparison main event, and that from that point of view there's not much difference between an MGQT and a zone. Despite the protests of the PolyScore developer about a zone test that does not include symptomatics, there is only one PolyScore algorithm that evaluates both MGQT and zone tests, and MGQT has never included symptomatics. Okay, so now I've not only come back to symptomatics but have added stim tests, just in case you guys needed something else to consider. IP: Logged |
ebvan Member
|
posted 08-08-2006 01:53 PM
JR let me see if I understand you correctly. Fot the sake of discussion you are putting forward the postion that if examiners do not evaluate the symptomatic questions in a format that they should say that they only evaluate the relationship between the CQ's and RQ's in their diagnosis and that the "other questions" Irrelevant , Sacrifice Relevant , and Symptomatics serve only as buffers and provide structure for the test. If that is the case the only difference between and MGQT and a ZCT would be the totalling of spot scores in the ZCT. If I am missing your point, please elaborate. IP: Logged |
dkrapohl Member
|
posted 08-08-2006 08:47 PM
Just wanted to jump in and comment that it would be nice if the entire profession could engage in these kinds of high-level discussions. We might be able to avoid some of the infighting between the graduates of the different schools, and we could all learn something about this fascinating phenomenon we exploit in polygraphy. It's good to take a look at our assumptions and dogma, and measure them against the extensive and growing body of research. All we would have to lose is our ignorance.I did want to make one other point. There is research that indicates that a stim test (that's "acquaintence test" in governmentese) given as the first chart does improve decision accuracy. It was a federally funded project conducted by the Kircher group. John Kircher told me that the effect was slightly larger when the examiner told the examinee that the item was easy to see in the test, though there was a benefit even if there was no further conversation about it after the stim test. There are lots of other "validated" principles in our field. The symptomatic question is not one them. Many of Cleve's notions have withstood empirical tests (asymmetric cutoffs, a CQ before each RQ, single-issue testing), but the symptomatic question has not faired nearly so well. Think about it: if the symptomatic question did what it is supposed to do, shouldn't it be used in every format? Keep up the good work you do out there. Don
IP: Logged |
Barry C Member
|
posted 08-09-2006 05:44 PM
Okay, here's the results of the Honts & Amato study I mentioned earlier:Outside issue questions don't do what they are intended to do (detect OIs). OIs can affect your test, but not the way people would expect: False positives (yes, false positives) went up from 8% to 46%. False negatives went from 0 to 4%. There was no useful info in the physiological responses to the OIQs. IP: Logged | |